15 research outputs found

    Recognition and Prediction for Implicit Contrastive Focus in Romanian

    Get PDF
    This paper is made up of two parts: \textbf{the first part} continues the theoretical investigations on Information Structure (IS), linguistic, and computational approaches suitable to provide solutions to the prosody prediction problem of Implicit Contrastive Focus (ICF) concept, introduced in our previous papers. ICF is meant to be the particular case but also the counterpart of the classical category of contrastive Focus at the finite clause level, as the second item in the Background-Focus pair of IS dimension. The classical contrastive Focus, which we called Explicit Contrastive Focus (ECF), is the intonationally F marked entity introduced by overt lexical contrastive markers. ICF labels the situations where contrastive intonational focusation occurs but without the lexical presence of the contrastive Focus markers! The only and main device to introduce the contrastive focusation on certain constituents is the syntactic dislocation from their specific positions in the Systemic Ordering (SO) of syntactic-semantic roles for the Romanian finite clause. The ICF problem means to obtain reliable algorithms and procedures on the Discourse-Prosody interface in order to accurately predict the contrastive Focus distribution within the Romanian ICF-type affirmative finite clause. \textbf{The second, applicative part} of the paper describes algorithms for solving the ICF problem for Romanian, trying to exploit the typically dislocated constituents in the finite clause and to predict their Prosodic Prominence (PP). Procedures for the development of intonational-prosodic patterns assigned to the ICF distribution by certain ICF estimation schemes are developed and tested for a balanced set of Romanian ICF-type affirmative finite clauses

    CoRoLa Starts Blooming – An update on the Reference Corpus of Contemporary Romanian Language

    Get PDF
    This article reports on the on-going CoRoLa project, aiming at creating a reference corpus of contemporary Romanian (from 1945 onwards), opened for online free exploitation by researchers in linguistics and language processing, teachers of Romanian, students. We invest serious efforts in persuading large publishing houses and other owners of IPR on relevant language data to join us and contribute the project with selections of their text and speech repositories. The CoRoLa project is coordinated by two Computer Science institutes of the Romanian Academy, but enjoys cooperation of and consulting from professional linguists from other institutes of the Romanian Academy. We foresee a written component of the corpus of more than 500 million word forms, and a speech component of about 300 hours of recordings. The entire collection of texts (covering all functional styles of the language) will be pre-processed and annotated at several levels, and also documented with standardized metadata. The pre-processing includes cleaning the data and harmonising the diacritics, sentence splitting and tokenization. Annotation will include morpho-lexical tagging and lemmatization in the first stage, followed by syntactic, semantic and discourse annotation in a later stage

    17th-century Romanian lexical resources and their Influence on Romanian written tradition

    No full text
    This paper focusss on the first Slavonic-Romanian lexicons, compiled in the second half of the 17th century and their use(rs), proposing a method of investigating the manner in which lexical information available in the above corpus relates, if at all, to the vocabulary of texts from the same period. We chose to investigate their relation to an anonymous Old Testament translation made from Church Slavonic, also from the second half of the 17th century, which was supposed to be produced in the same geographical area, in the same Church Slavonic school or even by the same author as the lexicons. After applying a lemmatizer on both the Biblical text (Books of Genesis and Daniel) and the Romanian material from the lexicons, we analyse the results and double the statistical analysis with a series of case studies, focusing on some common lexemes that might be an indicator of the relatedness of the texts. Even if the analysis points out that the lexicons might not have been compiled as a tool for the translation of religious texts, it proves to be a useful method that reveals interesting data and provides the basis for more extensive approaches

    The first Romanian dictionaries (17th century). Digital aligned corpus

    No full text
    This paper presents the project “The first Romanian bilingual dictionaries (17th century). Digitally annotated and aligned corpus” (eRomLex) which deals with the editing of the first bilingual Romanian dictionaries. The aim of the project is to compile an electronic corpus comprising six Slavonic-Romanian lexicons dating from the 17th century, based on their relatedness and the fact that they follow a common model in order to highlight the characteristics of this lexicographical network (the affiliations between the lexicons, the way they relate to the source, the innovations towards it, their potential uses) and to facilitate the access to their content. A digital edition allows exhaustive data extraction and comparison and link with other digitized resources for old Romanian or Church Slavonic, including dictionaries. After presenting the corpus, we point to the necessary stages in achieving this project, the techniques used to access the material and the challenges and obstacles we encountered along the way. We describe how the corpus was created, stored, indexed and can be searched over; we will also present and discuss some statistical analyses highlighting relations between the Romanian lexicons and their Slavonic-Ruthenian source